Extracting News Web Page Creation Time with DCTFinder

نویسنده

  • Xavier Tannier
چکیده

Web pages do not offer reliable metadata concerning their creation date and time. However, getting the document creation time is a necessary step for allowing to apply temporal normalization systems to web pages. In this paper, we present DCTFinder, a system that parses a web page and extracts from its content the title and the creation date of this web page. DCTFinder combines heuristic title detection, supervised learning with Conditional Random Fields (CRFs) for document date extraction, and rule-based creation time recognition. Using such a system allows further deep and efficient temporal analysis of web pages. Evaluation on three corpora of English and French web pages indicates that the tool can extract document creation times with reasonably high accuracy (between 87 and 92%). DCTFinder is made freely available on http://sourceforge.net/projects/dctfinder/, as well as all resources (vocabulary and annotated documents) built for training and evaluating the system in English and French, and the English trained model itself.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Incremental Web Search: Tracking Changes in the Web

A large amount of new information is posted on the Web every day. Large-scale web search engines often update their index slowly and are unable to present such information in a timely manner. In this thesis, we present our solutions of searching new information from the web by tracking the changes of web documents. First, we present the algorithms and techniques useful for solving the following...

متن کامل

Automatic Extraction of Event Information from Newspaper Articles and Web Pages

In this paper, we propose a method for extracting travelrelated event information, such as an event name or a schedule from automatically identified newspaper articles, in which particular events are mentioned. We analyze news corpora using our method, extracting venue names from them. We then find web pages that refer to event schedules for these venues. To confirm the effectiveness of our met...

متن کامل

Template-Independent News Extraction Based on Visual Consistency

Wrapper is a traditional method to extract useful information from Web pages. Most previous works rely on the similarity between HTML tag trees and induced template-dependent wrappers. When hundreds of information sources need to be extracted in a specific domain like news, it is costly to generate and maintain the wrappers. In this paper, we propose a novel templateindependent news extraction ...

متن کامل

Arabic News Articles Classification Using Vectorized-Cosine Based on Seed Documents

Besides for its own merits, text classification (TC) has become a cornerstone in many applications. Work presented here is part of and a pre-requisite for a project we have overtaken to create a corpus for the Arabic text process. It is an attempt to create modules automatically that would help speed up the process of classification for any text categorization task. It also serves as a tool for...

متن کامل

A Temporal Relation-based Method for Extracting Occurrence Time in Chinese Web News of Emergencies

Temporal relation theory is used to extract the occurrence time of the emergencies from web news. The Chinese temporal relation is firstly brought in. Then expressive features are summarized up from the composition, location and commonly used words of temporal information. The overlapping method of extracting occurrence time in web news is proposed and implemented by a system. The experiments b...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2014